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Abstract 




We propose a restricted collapsed draw (RCD) 
sampler, a general Markov chain Monte Carlo 
sampler of simultaneous draws from a hierar- 
chical Chinese restaurant process (HCRP) with 
restriction. Models that require simultaneous 
draws from a hierarchical Dirichlet process with 
restriction, such as infinite Hidden markov mod- 
els (iHMM), were difficult to enjoy benefits of 
the HCRP due to combinatorial explosion in 
calculating distributions of coupled draws. By 
constructing a proposal of seating arrangements 
(partitioning) and stochastically accepts the pro- 
posal by the Metropolis-Hastings algorithm, the 
RCD sampler makes accurate sampling for com- 
plex combination of draws while retaining effi- 
ciency of HCRP representation. Based on the 
RCD sampler, we developed a series of sophis- 
ticated sampling algorithms for iHMMs, includ- 
ing blocked Gibbs sampling, beam sampling, and 
split-merge sampling, that outperformed conven- 
tional iHMM samplers in experiments. 

1 Introduction 

Existing sampling algorithms for infinite hidden Markov 
models (iHMMs, also known as the hierarchical Dirichlet 
process HMMs) [??] do not use a hierarchical Chinese 
restaurant process (HCRP) [?], which is a way of repre- 
senting the predictive distribution of a hierarchical Dirich- 
let process (HDP) by collapsing, i.e. integrating out, the un- 
derlying distributions of the Dirichlet process (DP). While 
an HCRP representation provides efficient sampling for 
many other models based on an HDP [??] through reduc- 
ing the dimension of sampling space, it has been consid- 
ered rather "awkward" [?] to use an HCRP for iHMMs, 
due to the difficulty in handling coupling between random 
variables. In the simplest case, consider step-wise Gibbs 
sampling from an iHMM defined as 7Tfc ~ DP(/3, a>o) and 



Figure 1 : Step-wise Gibbs sampling in iHMM. Since the 
Dirichlet process prior is posed on transitions in iHMM, 
resampling xi involves taking two transitions, Xi-\ — > Xi 
and Xi — > sci+i, simultaneously. In this case, we consider 
distribution of two draws {x\ , x' i+1 ) with restriction that the 
draws are consistent with remaining sequence, i.e., x' i+1 = 

Xi+l. 

(3 ~ GEM(7). Given xi,..., Xi-i,Xi+i, . . . , xt, resam- 
pling hidden state Xi at time step i actually consists of two 
draws (Figure 1), x\ ~ Tt Xi _ x and x' i+1 ~ n x >,, under the 
restriction {x' i: x' i+l ) G C that these draws are consistent 
with the following sequence, i.e., C = {(x' i ,x' i+1 )\x' i+1 = 
Xi + i}. Under the HCRP, the two draws are coupled even 
if Xi-i 7^ x'i, because distributions 7T Xi _ 1 , Tv x r as well as 
the base measure (3 are integrated out in an HCRP, and cou- 
pling complicates sampling from the restricted distribution. 

To generalize, the main part of the difficulty is to obtain 
a sample from a restricted joint distribution of simulta- 
neous draws from collapsed distributions, which we call 
restricted collapsed draw (RCD). Consider resampling L 
draws simultaneously, x = (xj 1 i 1 , . . . , Xj L i L ), from the 
respective restaurants j = (ji, . . . when we have a 
restriction C such that x G C. Step-wise Gibbs sampling 
from iHMM can be fitted into RCD with L = 2 by allowing 



restaurant index ji to be dependent on the preceding draw 

Xj 1 i 1 . 

In this paper, we point out that it is not enough to consider 
the distribution of draws. Since the HCRP introduces an 
additional set of latent variables s that accounts for the seat- 
ing arrangements of the restaurants, we have to compute an 
exact distribution of s as well, under the restriction. We 
want to perform sampling from the following conditional 
distribution, 

p(x,a\C) = ^-l[xeC]p(x,a) , (1) 

where Zq is a normalization constant and I is the indicator 
function, whose value is 1 if the condition is true and oth- 
erwise. Although non-restricted probability p(x, s) can be 
easily calculated for a given x and s, calculating the nor- 
malization constant Zc leads to a combinatorial explosion 
in terms of L. 

To solve this issue, we propose the restricted collapsed 
draw (RCD) sampler, which provides accurate distribu- 
tions of simultaneous draws and seating arrangements from 
HCRP. The RCD sampler constructs a proposal of seating 
arrangements using a given proposal of draws, and the pair 
of proposals are stochastically accepted by the Metropolis - 
Hastings algorithm [?]. Since the RCD sampler can han- 
dle any combination of restricted collapsed draws simul- 
taneously, we were able to develop a series of sampling 
method for HCRP-HMM, including a blocked collapsed 
Gibbs sampler, a collapsed beam sampler, and a split- 
merge sampler for HCRP-HMM. Through experiments we 
found that our collapsed samplers outperformed their non- 
collapsed counterparts. 

2 HCRP representation for iHMM 
2.1 Infinite HMM 

An infinite hidden Markov model (iHMM) [??] is defined 
over the following HDP: 

G ~DP( 7 ,ff) G fc ~DP(a ,G ) , (2) 

To see the relation of this HDP to the transition matrix 7r, 
consider the explicit representation of parameters: 

oo oo 

Go = Pk'4>k> G k = ^2 Kk'k^k' , (3) 

k' = l k' = l 

where transition probability TVk is given as 7Tfe ~ 
DP(ao,/3), /3 ~ GEM(7) is the stick-breaking construc- 
tion of DPs [?], and (f> k ~ H. 

A formal definition for the HDP based on this representa- 
tion is: 

/3| 7 ~GEM( 7 ) 7r>o,/3~DP(a ,/3) (4) 

Xji\(nk)kLi ~ "Ki <Pk ~ H y ]t \x ]t ~ F((/) X]i ) , (5) 



Given an HDP and initial state xq, we can construct an 
infinite HMM by extracting a sequence of draws Xi as 
%i = x Xi l i, and corresponding observations yi = y Xi i. 
Figure 2 shows a graphical representation of the iHMM. 

2.2 HCRP-HMM 

As another way of representing HDP in iHMM (Eq. 2), we 
introduce a hierarchical Chinese restaurant process (HCRP, 
also known as the Chinese restaurant franchise), which 
does not need to sample the transition distribution ir and 
its base measure (3 in Eq. (4): 

M7~CRP(7) t ii |ao~CRP(ao) (6) 

Xji = kjt^ (7) 

<t>k ~ H Vji\xji, 4> ~ F{(j> Xji ) . (8) 

Using the Chinese restaurant metaphor, we say that cus- 
tomer i of restaurant j sits at table tji, which has a dish of 
an index kj tji . 

To understand connection between HDP and HCRP, con- 
sider a finite model of grouped observations Xji, in which 
each group j choose a subset of M mixture components 
from a model-wide set of K mixture components: 



/3| 7 ~Dir( 7 /#,...,7/#) 
~j |a ~ Dir(a /M, . . . , a /M) tji\rj 



(9) 

■l">3 (io) 

As K — > oo and M — > oo, the limit of this model is HCRP; 
hence the infinite limit of this model is also HDP. Equa- 
tion (6) is derived by taking the infinite limit of K and M 
after integrating out (3 and r in Eqs. (9) and (10). The 
distribution ttj in Eq. 4 can be derived from Tj and kj as 
follows: 
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To consider sampling of Xji using HCRP (Eqs. 7 and 8), 
we use count notation rijtk as the number of customers in 
restaurant j at table t serving the dish of the fc-th entry, and 
rrijk as the number of tables in restaurants the j serving the 
dish of the fc-th entry. We also use dots for marginal counts 
(e.g., m.fe = J2j m jk)- Then, we sample table index tji 
from the following distribution: 
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When tji = t new 



(i.e., the customer sits at a new table), we 



need to sample fcj t nci U , whose distribution is: 
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These variables determine the new sample Xji = kjt }i . 
Since Xji does not uniquely determine the state of the 
HCRP model, we need to keep latent variables tji and 




Figure 2: Graphical Representation of iHMM. 



kj t for subsequent sampling. We will denote = 
(tji, tj2, ■ ■ •) as the seating arrangement in restaurant j, 
s(°) = (feu, fci2, ■ • ■ , feij ■ • ■) as the seating arrangement 
in the root restaurant, and s as the collection of all seating 
arrangements, corresponding to the sampled model state. 
In Bayesian inference based on sampling, we need a proce- 
dure to sample the latent variables, given the value of new 
draw Xji and the seating arrangements for other draws s, 
which is called as addCustomer. 

Construction of HCRP-HMM is the same as iHMM, i.e., 
extracting a sequence of draws x ; given x as Xi = x Xi _^, 
and corresponding observations = y Xii . 

3 Restricted Collapsed Draw Sampler 

What we want is a sampling algorithm for HCRP-HMM. 
As described in the Introduction, the problem can be re- 
duced to an algorithm for sampling from p(x,s\C), i.e., 
the distribution of restricted collapsed draw with seating 
arrangements (Eq. 1). 

Our idea is to apply the Metropolis-Hastings algorithm 
[?] to the seating arrangements, which stochastically ac- 
cepts the proposal distribution of seating arrangements. Al- 
though it is hard to directly give proposal distribution q(s) 
of seating arrangements, our method constructs q(s) by 
combining q x (x) with q s (s\x), another proposal of seat- 
ing arrangements given the proposed draws, which is based 
on the addCustomer procedure that is standardly used in 
Gibbs sampling of HCRP. 

3.1 Overall sampling 

The Metropolis-Hastings algorithm provides a way of con- 
structing an MCMC sampler using unnormalized probabil- 
ity value p(z). After sampling z* from proposal distribu- 
tion q(z*\z), the algorithm computes acceptance probabil- 
ity R: 

p(z*) q(z° ld \z*) 



R = 



1. 



p(z old ) q(z*\z° ld ) 



Then the result z new = z* with probability R, and 
z new _ z oid otherwise. Repeating this process consti- 
tutes an MCMC sampler from required distrubution p(z) oc 
P(*)> 

Within the context of HCRP, sample space z consists of 
draws x and seating arrangement s. From Eq. (1), we can 
use the non-restricted probability of draws p(x, s) as un- 
normalized probability value p(z), but it is not easy to pro- 
vide a proposal for joint distribution q(x* , s*). 

Our idea is to factorize the proposal distribution as: 

q(x*,s*\s ) = q x (x*\s ) ■ q s (s*\x*, s ) . (17) 

First factor q x is the proposal distribution of the draws. 
Second factor q s is the proposal distribution of the seat- 
ing arrangements given the proposal draws. We use the 
result of the addCustomer procedure, which stochastically 
updates the seating arrangements, as the proposal distribu- 
tion of the seating arrangements. 

3.2 Computing Factors 

The following describes each factor in R and its computa- 
tion. 

True Probability p(x, s) in Eq. (1) is the joint probabil- 
ity of all draws Xjf. 



p(x,s)=l[p( S ^)-p( S W) 



(18) 



where 



* fa)) = rST^-""'-nn»,0 (19) 

1 (7 + m..) 



l K -l[T(m. k ) , (20) 



(16) 



and r is the Gamma function. This is the product of the 
probabilities of seating arrangements (Eqs. 12 to 15) for 
each customer. 

In practice, we only need to calculate probabilities that ac- 
count for the change in seating from sq, because the prob- 



ability for unchanged customers is cancelled out through 
reducing the fraction in R. Let So be the seating arrange- 
ment for the unchanged customers, then 

P(«*) = P(s*\so) 
p(s old ) p(s° ld \s ) 

In fact, p(x*,s*\so) is easily calculated along with add- 
Customer operations: 

p{x*,s*\s ) = 

P(X*,8*\S0)P(X2> S 2\ 8 1) ■ ■ ■ P( x L, s *\ s l-l) ■ ( 22 ) 

Here, p(xg, 8t\se-i) is probability p(x j{ie , t jeie \j e , s^_i) 
of obtaining seating arrangement si as a result of drawing 
a sample from restaurant j: 

p(Xji = k,tji = t\j, s) = — 
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(23) 



The same applies to the calculation of p(s old \so), which 
can be done along with removeCustomer operations. 

Proposal Distribution of Draws q(x) can be anything 
as long as it is ergodic within restriction C. To increase 
the acceptance probability, however, it is preferable for the 
proposal distribution to be close to the true distribution. 
We suggest that a good starting point would be to use a 
joint distribution composed of the predictive distributions 
of each draw, as has been done in the approximated Gibbs 
sampler [?]: 

L 

q x {x)=l[x eC]Y[p{x t \s ) . (24) 

i=l 

We will again discuss the proposal distribution of draws for 
the HCRP-HMM case in Section 4. 

Proposal Distribution of Seating Arrangements 

q s (s*\x, So), is the product of the probabilities for each 
operation of adding a customer: 

q s (s*\x*,s ) = q s (s* 1 \x* 1 ,s )q s (s* 2 \x* 2 ,s* 1 ) 

••• q s {s*\x},s* t _ 1 ) . (25) 
Here, q s (s e \x e , at-i) = p(t 



ji,8i-{), i.e., the 
probability of obtaining seating arrangement se as a result 
of the addCustomer(a; Jf i £ , jg, s^_i) operation. 



p(tji = t\Xji = k,j, s) 
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m.. +7 



rij.k + a Q 
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(26) 



(27) 



3.3 Simplification 

Paying attention to the fact that both Eqs. (23) and (27) are 
calculated along a series of addCustomer calls, we intro- 
duce factors 

p(xl,a t \8t-i) „ ld _ 



q s (si\x* e ,s e -i) q s {st\xf d , sji-i) 

to simplify the calculation of R as: 

p{s*) q s (s otd \x° ld ,Sg)q x (x old ) 
p(s° td ) q s (s*\x*,S ) q x (x*) 
r(x*,s*\s ) q{x old ) 



(28) 



R = min I 1, 
= min ( 1, 



where 

r(x*,s*\s ) 



(x old ,s old \s ) q{x*) 



p{x*,s*\s 



(29) 



q s (s*\x*,s ) 

p(xl,sx\s ) p(x* L ,s L \s L -i) 



q s (si\xl,s ) q s (s L \x* L ,s L -i) 
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(30) 



Surprisingly, assigning Eqs. (23) and (27) into Eq. (28) re- 



veals that r\ is equal to p(xj 



z \ l s l-i)' i- e -' prob- 



ability of new customer Xj { i e at restaurant j# eating dish 
x i '■ 



p( Xji = k\s) = ^±2_ (3i) 
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m.k+'T 



(32) 



p{ Xji = k new \s) 

rij.. + Qo 

In other words, calculation of the accept ratio does not use 
tji (the table index of each customer), despite the fact that 
the values of tji are being proposed; tji will indirectly af- 
fect the accept ratio by changing subsequent draw prob- 
abilities p(x£ + i\si),p(xg +2 \sg +1 ), . . . through modifying 
rijtk and rrijk, i.e., the number of customers and tables. 

It is now clear that, as done in some previous work [?], we 
can save storage space by using an alternative representa- 
tion for seating arrangements s, in which the table indices 
of each customer tji are forgotten but only the numbers of 
customers rijt., kjt and rrijk are retained. The only remain- 
ing reference to tji in the removeCustomer procedure can 
be safely replaced by sampling. 

However, it should be noted that we have to revert to orig- 
inal seating assignment s old whenever the proposal is re- 
jected. Putting the old draws x old back into So by using 
the addCustomer procedure again will lead sampling to 
an incorrect distribution of seating assignments, and con- 
sequently, an incorrect distribution of draws. 

Algorithm 1 is the one we propose othat obtains new sam- 
ples x new drawn simultaneously from restaurants indexed 
by j and associated seating arrangement s new , given pre- 
vious samples x old and s old . 



Algorithm 1 MH-RCDSampler(j, x otd , s old ): Metropolis-Hastings sampler for restricted collapsed draw 
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for i = L downto 1 do 



-.old 

'£-1 — 
^old 



removeCustomer^ , jg, ) 

. ^(ry.old a old \ 



-.old 
'0 



end for 

s o = s o = s o 
x* ~ q x (x; s ) 
for £ = ltoL do 

r e = P( x e> s e-i) 

s* t = addCustomer(a;|, ji, s^^) 

end for 

S* = S* r 



13: R = min 1, 



q x {s old ) 
q x (s*) 2J[ r 



1 1 „old 



{ Remove customers for x° 



id 



old 



sequentially from s } 



14: return (x r 



(x*, s*) with probability R 
(x old ,s old ) otherwise. 



{ Calculate factors for accept ratio } 

{ Draw x* from proposal distribution q(x) of draws. } 

{ Calculate factors for accept ratio } 
{ Add customers for x\ , . . . , x* m sequentially to Sq } 

{ Obtain proposal seating s* } 
{ Calculate acceptance probability } 

{ Accept/reject proposed sample } 



The first half of this sampler is similar to a sampler 
for a single draw; it consists of removing old cus- 
tomers (line 3), choosing a new sample (line 7), and 
adding the customers again (line 10). The main differ- 
ence is that there are L times of iteration for each call 
removeCustomer/addCustomer, and the calculation of r, 
which is later used for acceptance probability R. 

4 Gibbs sampler for HCRP-HMM 

This section describes a series of samplers for HCRP- 
HMM. First, we present the step-wise Gibbs sampler as the 
simplest example. After that, we describe a blocked Gibbs 
sampler using a forward-backward algorithm. We also ex- 
plain the HCRP version of the beam sampler [?] as well as 
the split-merge sampler [?] for iHMM. 

4.1 Step- wise Gibbs sampler 

A step-wise Gibbs sampler for HCRP-HMM is easily con- 
structed using an RCD sampler (Algorithm 5 in the Ap- 
pendix describes one Gibbs sweep). We slightly modified 
the proposal distribution q(xt) from that suggested in Sec- 
tion 3.2, in order to ensure that Xt+i is proposed with non- 
zero probability even when no table in Sq serves dish x t +\'. 

q x (x t ) oc fp(xt|4 Xt -°) + (V— ^7— tW> 

V V( Q! o + n Xt _ 1 ..)(7-l-m..)y + . 

• K^+il4 at) )-Kytl4 Xt) ) ■ (33) 

4.2 Blocked Gibbs sampler 

We can construct an alternate sampling scheme under 
the framework of RCD sampler that resamples a block 



of hidden states simultaneously, based on the forward- 
backward sampler [?]. The idea is that we run the forward- 
backward sampler with a predictive transition distribution 
from HCRP-HMM, and use the result as a proposal of re- 
stricted collapsed draw. 

For iHMM, the forward-backward sampling algorithm [?] 
cannot be directly used, because the forward probability 
values for an infinite number of states have to be stored for 
each time step t [?]. This is not the case for HCRP-HMM, 
because predictive transition probability -fr from given seat- 
ing assignment Sq, which is given as Eqs. (31) and (32), 
only contains transition probability for finite number K of 
states plus one for k new . Thus we only need to store K + 1 
forward probability for each time step t. 

Result x of the forward-backward sampler, however, can- 
not be used directly as the proposal; the i-th state of the 
proposal x* is equal to x,i when x,i ^ k new , but we need 
to assign new state indices to x* whenever xi — k new . In 
particular, when k new has appeared W > 2 times, all ap- 
pearances of k new may refer either to the same new state, 
or to W different states, or to anything in between the two, 
in which some appearances of k new share a new state. 

To achieve this purpose, we prepare special CRP Q* that 
accounts for the previously unseen states, marked by k new 
in the result of the forward-backward sampler. Specifically, 
each table in Q* has a dish with an unused state index, and 
each appearance of k new is replaced with a draw from Q*. 
This construction ensures that every state sequence is pro- 
posed with a non-zero probability, and allows the proposal 
probability to be easily calculated. The concentration pa- 
rameter of Q* is set as equal to 7. To handle the case where 
some of the new states are equal to x tb+1 , i.e., index of the 
state that succeeds to the resampling block, we add to Q* 



an extra customer that correponds to x tb+1 when x tb+1 does 
not appear in So, 

Resulting proposal probability is: 

q x (x*) = 

[ n^u-is* ■ n-^w ) ■ n p^iq*) > 

\e=o 1=1 ) e-.xe=k n " w 

(34) 

where the first factor accounts for the forward probability 
of the sequence, and the second factor accounts for proba- 
bility of the new state assignment. 

Note also that, to make a sensible proposal distribution, we 
cannot resample the whole state sequence simultaneously. 
We need to divide the state sequence into several blocks, 
and resample each block given the other blocks. The size 
of a block affects efficiency, because blocks that are too 
large have lower accept probability, while with blocks that 
are too small, the algorithm has little advantage over step- 
wise Gibbs sampling. 

Algorithm 8 in the Appendix describes one sweep of a 
blocked Gibbs sampler for an HCRP-HMM. 

4.3 Beam sampling 

Beam sampling for HDP-HMM [?] is a sampling algo- 
rithm that uses slice sampling [?] for transition probability 
to extract a finite subset from the state space. Although the 
possible states are already finite in HCRP-HMM, the same 
technique may benefit sampling of HCRP-HMM by im- 
proving efficiency from the reduced number of states con- 
sidered during one sampling step. 

We just need replace the call to ForwardBackwadSampling 
in Algorithm 8 with the call to BeamSampling to use beam 
sampling with HCRP-HMM. A brief overview of the beam 
sampling is: 

1. Sample auxiliary variables u = (uq, ■ ■ ■ , Ul) as ug ~ 
Uniform(0, 71-^^), 

2. For I = 1, calculate forward probability 
q(x' e = k') using a slice of transition probability 

q(x' e = k') = F k >{y t )YjkK^k'k > ui-i)q(x! t _ x = k), 

3. For I = L, . . , , 1, sample the states x' t backwardly, 

i.e. p(x' e = k) oc l{^ x ' e+i k > ue). 

For details, refer to the original paper [?]. 

Some remarks may be needed for the calculation of g*, 
i.e., the proposal probability for the state sequence. Al- 
though beam sampling has a different proposal distribution 
from forward-backward sampling, we can use the same cal- 
culation of proposal probability used in acceptance proba- 
bility as that of forward-backward sampling. This is be- 



cause beam sampling satisfies the detailed balance equa- 
tion, which ensures that the ratio of proposal probability 
with beam sampling q %\ i " is always equal to the ratio of the 

Qslice 

probability obtained by forward-backward sampling -fra-. 



4.4 Split-Merge Sampling 

We can integrate the split-merge sampling algorithm, 
which is another sampling approach to Dirichlet process 
mixture models [?], into HCRP-HMM using the RCD sam- 
pler. A split-merge sampler makes a proposal move that 
tries to merge two mixture components into one, or to split 
a mixture component into two; the sampler then uses a 
Metropolis-Hastings step to stochastically accept the pro- 
posal. Based on the RCD framework, we can extend the 
split-merge sampler into HCRP, which can be applied to 
HCRP-HMM. Within the context of HMM, the sampler 
corresponds to merge two state indices into one, or to split 
a state index into two. 

Our implementation is based on an improved version of 
hte split-merge sampler, called the sequentially-allocated 
merge-split sampler [?], which produces a split proposal 
while sequentially allocating components in random order. 
To deal with temporal dependency in HMM, we identify 
fragments of state sequences to be resampled within the 
state sequence, and perform blocked Gibbs sampling for 
each fragment in random order. 

We added one important optimization to the split-merge 
sampling algorithm. Since a merge move is proposed much 
more frequently than a split move, and the move has a rel- 
atively low accept probability, it is beneficial if we have a 
way of determining whether a merge move is rejected or 
not earlier. Let us point out that, when proposal proba- 
bility for a merge move is calculated, the accept probabil- 
ity is monotonically decreasing. Consequently we sample 
R thr , the threshold of accept probability, at the beginning 
of the algorithm and stop further calculation when R be- 
comes less than R thr . Algorithm 9 in the Appendix is the 
split-merge sampling algorithm for HCRP-HMM. 

Split-merge sampling allows faster mixing when it is inter- 
leaved with other sampling strategies. We examine split- 
merge sampling with each of the samplers we have pre- 
sented in this paper. 



5 Experiments and Discussion 

This section presents two series of experiments, the first 
with small artificial sequences and the second with a se- 
quence of natural language words. 



5.1 Settings 

We put gamma prior Gamma(l, 1) on ao and 7, and 
sampled between every sweep using an auxiliary variable 
method [?] in all the experiments. We introduced HCRP 
as a prior of emission distributions as well, and its hyper- 
parameters were also sampled in the same way. 

The initial state sequence given to the sampler is the result 
of a particle filter with 100 particles. 

We measured autocorrelation time (ACT) to evaluate mix- 
ing. Given a sequence of values x — x\, X2, ■ ■ ■ , xt, its 
mean p and variance a 2 , ACT(x) are defined as follows: 



ACF t {x) 



1 



T-t 



(xi — n){xi+t — fx) (35) 
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Figure 3: Automaton that generates Sequence 2. Circles 
denote hidden states, and the same alphabet emissions are 
observed from states within an oval group. A dashed ar- 
row denotes transition with probability 0.8, a bold arrow 
denotes transition with probability 0.84, and a solid arrow 
denotes emission with probability 1/3. 



ACT(x) = - + }_^ACF t {x) 

t=i 



(36) 



Since with larger t, ACF^x) is expected to converge to 
zero, we used ACF^x) for t < 1000. 

For artificial sequence, we evaluated mutual information 
between the h t , hidden state used in sequence generation 
and xt, inferred states as follows: 



MI = ^2^2p(x,h)]og 



h x 



P{x,h) 
p(x)p(h) 



(37) 



For natural language text, the inferred model is evaluated 
by multiple runs of a particle filter on a given test sequence 
of length Tt e st- We specifically construct a particle filter 
with Z = 100 particles for each sampled model state s z , 
and evaluate likelihood | s z ) for each emission. Finally, 
we calculate the perplexity (the reciprocal geometric mean 
of the emission probabilities) of the test sequence: 



PPL = cxp (-J—Y / lo S hi) 



where 



Z^ 

z=l 



(38) 



(39) 



The samplers we chose for comparison are the step-wise 
Gibbs sampler with direct assignment representation [?], 
which uses stick-breaking for the root DP and CRP for the 
other DPs, the step-wise Gibbs sampler with stick-breaking 
construction, and the beam sampler with stick-breaking 
construction [?]. For fair comparison between different al- 
gorithms, we collected samples to evaluate the autocorre- 
lation time and perplexity on a CPU-time basis (excluding 
the time used in evaluation). All the algorithms were im- 
plemented with C++ and tested on machines with an Intel 
Xeon E5450 at 3 GHz. 



5.2 Artificial data 

The first series of experiments are performed with two 
small artificial sequences. Sequence 1 consists of repeat- 
ing sequence of symbols A-B-C-D-B-C-D-E-... for length 
T = 500, and we run the sampler 30 s for burn-in, and af- 
ter that, a model state is sampled every 2 s until a total of 
300 s is reached. Sequence 2 is generated from the simple 
finite state automaton in Figure 3 for length T = 2500, and 
we use 60 s for burn-in and total 600 s. We evaluated the 
mutual information between the inferred hidden states and 
the true hidden states. 

Figure 4 shows the distribution of mutual information for 
100 trials after 300 s. We can see that some of the samplers 
based on the proposed method achieved a better mutual in- 
formation compared to existing samplers. The improve- 
ment depends on the type of sequence and the samplers. 

For Sequence 1, we can see that split-merge sampling 
yields better results compared to other samplers. Although 
HMM with eight hidden states can completely predict the 
sequence, the samplers tend to be trapped in a local opti- 
mum with five states in the initial phase, because our se- 
lected prior of 7 poses a larger probability on a smaller 
number of hidden states, Detailed investigations (Figure 5) 
confirmed this analysis. 

For Sequence 2, on the other hand, blocked samplers 
worked very efficiently. Step-wise samplers generally 
worked poorly on the sequence, because the strong de- 
pendency on temporally adjacent states impedes mixing. 
Still, step-wise Gibbs sampler for HCRP-HMM outper- 
formed the beam sampler with the stick-breaking process. 
The blocked Gibbs sampler had inferior performance due 
to its heavy computation for a large number of states, but 
the beam sampler for HCRP-HMM was efficient and per- 
formed well. Combination with a small number of split- 
merge samplers increases the performance (more split- 
merge sampling leads to lower performance by occupying 



computational resource for the beam sampler). From aver- 
ages statistics of samplers (Table 1), we can see that (1) the 
increase of mutual information cannot be described only 
by the increase of the number of states; (2) The accept ratio 
for the Gibbs trial has a very high accept rate; (3) Split- 
merge samplers have a very low accept rate, but still make 
improvement for mutual information. 

5.3 Natural language text 

We also tested the samplers using a sequence of natural 
language words from Alice's Adventure in Wonderland. We 
converted the text to lower case, removed punctuation, and 
placed a special word EOS after every sentence to obtain 
a corpus with 28, 120 words; we kept the last 1,000 words 
for test corpus and learned on a sequence with length T = 
27120. We introduce a special word UNK (unknown) to 
replace every word that occurred only once, resulting in 
|S| = 1, 487 unique words in the text. We took 10,000 s 
for burn-in, and sampled a model state for every 120 s, until 
the total of 172,800 s. Table 2 summarize the averaged 
statistics for 18 trials. 

We found that step-wise sampling outperformed blocked 
sampling (including beam sampling). The reason for this 
may be the nature of the sequence, which has a lower tem- 
poral dependency. Blocked Gibbs sampling, in particular, 
consumes too much time for one sweep to be of any prac- 
tical use. We also found that split-merge sampling had a 
very low accept rate and thus made little contribution to the 
result. 

Yet, we can see the advantage of using HCRP represen- 
tation over stick-breaking representation. The direct as- 
signment (DA) algorithm showed a competitively good 
perplexity, reflecting the fact that DA uses stick-breaking 
for only the root DP and uses the CRP representation for 
the other DP. Though step-wise Gibbs sampling and its 
slice sampling version seems outperforming DA slightly, 
we need to collect more data to show that the difference is 
significant. At least, however, we can say that now many 
sampling algorithms are available for inference, and we can 
choose a suitable one depending on the nature of the se- 
quence. 

6 Conclusion and Future Work 

We have proposed a method of sampling directly from con- 
strained distributions of simultaneous draws from a hier- 
archical Chinese restaurant process (HCRP). We pointed 
out that, to obtain a correct sample distribution, the seat- 
ing arrangements (partitioning) must be correctly sampled 
for restricted collapsed draw, and we thus proposed apply- 
ing the Metropolis-Hastings algorithm to the seating ar- 
rangements. Our algorithm, called the Restricted Collapsed 
Draw (RCD) sampler, uses a naive sampler to provide a 



proposal distribution for seating arrangements. Based on 
the sampler, we developed various sampling algorithms 
for HDP-HMM based on HCRP representation, including 
blocked Gibbs sampling, beam sampling, and split-merge 
sampling. 

The applications of the RCD sampler, which is at the heart 
of our algorithms, are not limited to HCRP-HMM. The 
experimental results revealed that some of the proposed al- 
gorithms outperform existing sampling methods, indicating 
that the benefits of using a collapsed representation exceed 
the cost of rejecting proposals. 

The main contribution of this study is that it opens a way 
of developing more complex Bayesian models based on 
CRPs. Since the RCD sampler is simple, flexible, and 
independent of the particular structure of a hierarchy, it 
can be applied to any combination or hierarchical struc- 
ture of CRPs. Our future work includes using this algo- 
rithm to construct new Bayesian models based on hierar- 
chical CRPs, which are hard to implement using a non- 
collapsed representation. Planned work includes extending 
HDP-IOHMM [?] with a three-level hierarchical DP (e.g., 
the second level could correspond to actions, and the third 
level, to input symbols). 
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Figure 5: Distribution of mutual information for Sequence 1. X-axis shows mutual information and Y-axis shows fre- 
quency. Block size « 6 for HCRP-HMM Beam sampling. 



Table 1 : Experimental results for Sequence 2 



name 


Ml 

1V11 


/Av- 1 TrSlaieS 


TrSLaieS 


sees/sweep 


f -wt r\r\C 0/*POnt rnfa 

vJlDDS aCCcpL TaLC 


\/l o /~* /~* ^ t~vt" rota 

olvl dCCcpL IdLc 


HlJr-rlMlVl (DA) oOlDDS 


Z.yZ 


U.JZ / 


i A Q1 n 


f\ f\A A 
U.U44 






Ur\D LI \ /1 1\ /I /CD\ Doom 

rlDr-rlMM (or> ) rseam 


f\A 


f\ AA f\ 
U.04U 


14. 1 10 








HCRP-HMM SGibbs 


3.18 


0.719 


16.210 


0.026 


0.999666 




HCRP-HMM SGibbs +SM=2 


3.28 


0.615 


17.190 


0.030 


0.999632 


0.000631 


HCRP-HMM SGibbs +SM=13 


3.19 


0.493 


16.830 


0.044 


0.999619 


0.000650 


HCPvP-HMM SSlice 


2.82 


0.820 


15.950 


0.009 


0.999847 




HCPvP-HMM SSlice +SM=2 


2.86 


0.705 


17.330 


0.013 


0.999822 


0.000827 


HCPvP-HMM SSlice +SM=13 


2.59 


0.604 


16.400 


0.030 


0.999830 


0.000910 


HCRP-HMM BGibbs 


3.01 


0.317 


14.900 


0.206 


0.995525 




HCRP-HMM BGibbs +SM=2 


3.12 


0.513 


16.270 


0.237 


0.995135 


0.000985 


HCRP-HMM BGibbs +SM=13 


3.18 


0.637 


16.530 


0.265 


0.994715 


0.000715 


HCRP-HMM Beam 


3.21 


0.866 


15.180 


0.016 


0.997233 




HCRP-HMM Beam +SM=2 


3.37 


0.898 


16.910 


0.019 


0.996369 


0.000497 


HCRP-HMM Beam +SM=13 


3.28 


0.875 


17.070 


0.034 


0.996316 


0.000532 



DA: Direct Assignment SB: Stick-Breaking construction 

MI: Mutual Information ACT: Auto-correlation time, samples collected for every 0. 1 s 
#states: number of states SGibbs: step-wise Gibbs 

SSlice: step- wise Gibbs with slice sampling (beam sampling with block size=l) 
BGibbs: blocked Gibbs (block size w 8) 

Beam: beam sampling (block size ~ 8 for HCRP-HMM, T for stick-breaking) 
SM: Split-Merge sampler (+SM=rt denotes SM trials per Gibbs sweep) 



Table 2: Experiments on Natural language text 



Sampler 


Perplexity 


# states 


sec/sweep 


Gibbs accept rate 


SM accept rate 


HDP-HMM (DA) SGibbs 


134.22 


313.056 


10.017 






HDP-HMM (SB) SGibbs 


151.10 


242.389 


38.045 






HDP-HMM (SB) Beam 


178.59 


68.444 


16.126 






HCRP-HMM SGibbs 


133.31 


379.833 


7.027 


0.999861 




HCRP-HMM SGibbs+SM=130 


131.66 


386.833 


7.664 


0.999857 


0.000052 


HCRP-HMM SGibbs+SM=5400 


135.94 


336.278 


31.751 


0.999880 


0.000050 


HCRP-HMM SSlice 


131.17 


422.000 


0.469 


0.999986 




HCRP-HMM SSlice+SM=130 


131.67 


409.833 


0.976 


0.999993 


0.000052 


HCRP-HMM SSlice+SM=5400 


152.76 


254.722 


36.617 


0.999993 


0.000056 


HCRP-HMM BGibbs 


199.14 


1840.833 


29380.261 


0.992748 




HCRP-HMM Beam 


141.77 


603.333 


80.681 


0.995627 




HCRP-HMM Beam+SM=130 


142.69 


567.278 


72.217 


0.995612 


0.000124 


HCRP-HMM Beam+SM=5400 


141.07 


495.667 


84.925 


0.995554 


0.000101 



For HCRP-HMM, the block size « 10. 



A Miscellaneous Algorithms 



Algorithm 2 getProb(j, k, s): Calculate p(xji = k\s) 
if m.fc = then 
return 



else 

return 
end if 



nj.k+aa- 



■ fc+ - 



■Tlj.. +a a 



Algorithm 3 addCustomer(j, k, s old ): Adds new cus- 
tomer eating dish k to restaurant j. 

s := s old 

With probabilities proportional to: rijtk (t = 
1, . . . , rrij.): Increment rijtk (the customer sits at i-th 
table) Qo T7 ["^ 7 : sit customer at a new table t new serv- 
ing dish k in restaurant j (rij^^k '■= 1, fcjt»«™ := fc, 
increment rrijk) 
return updated s 



Algorithm 4 removeCustomer(j, k, s old ): Removes ex- 
isting customer eating dish k from restaurant j. 



Sample t,, in proportional to rij t t k 
Decrement nj t i k (the customer at tji-tii table is re- 
moved) 

if njt ik becomes zero then 

Remove the unoccupied table tji from restaurant j, 

decrement rrijk 
end if 

return updated s 



B Step-wise Gibbs sampler 

To manipulate emission probability F(xi) with a conjugate 
prior, we introduced a similar notation to HCRP, which can 
be intuitively understood. 



Algorithm 5 Step-wise Gibbs sweep for HCRP-HMM 
Input: yi, . , . ,yf. observed emissions 

Xi, . . . , xf. previously inferred states 
g oid. set Q f qrp seating arrangements 
F : set of emission distributions 
for i = 1 , . . . , T, in random order do 
Si = removeCustomer(xi+i, xt, s old ) 
r oid _ g etProb(o;^ + i,a; t ,si) 
Fa = removeCustomer(?/i, Xi, F old ) 



~old 



getProb^^x^Si) 



50 = removeCustomer(xi, a?<_i, s{) 
rf d = getProb(x l , x^ u s ) 
Sample x* in proportion to q(x t ) where 

■p(yt\F xt )-p(x t+1 \si Xt) ) 

r{ = getProb(a; t *,x t _i,So) 

51 = addCustomerfxt, Xt-i, so) 
r| = getProb(y t , x u F ) 
F* = addCustomer(y 4 , xt, Fo) 

= getProb(ar i+ i,a£,si) 
s* = addCustomer(a;t+i, x\ , si) 
r i r *2 r 3 q(xt) 



(xt-i)- 



R := min ( 1, 

(x t ,s,F) 
end for 



„old ^old ^old 



(X*,S*,F* 



with probability R 



(x t ,s old ,F old ) otherwise 



C Blocked Gibbs sampler 

For details on the ForwardBackwardSampling routine, 
please refer to the literature [?]. 

Algorithm 6 removeSeq(i , i\, x, s old , F old ): remove 
customers for a part of state sequence (x^ , . . . , x^) 

L = i\ — io — 1 

sl = removeCustomer(a;i 0+ i, x ix , s old ) 
rf d =p{x ji = k\s) 
F L = F old 

for I = L — 1 downto do 

s e = removeCustomer(a; l6+ £,a; l6+ f_i,S£ + i) 
F e = removeCustomer(y lb+£ ,a; lb+ £,F f+1 ) 
rf d = g&t?rob{x ib+i ,x ib+ ^ 1 ,sg) ■ 

getProb(y ib+e ,x ib+ t,F £ ) 
end for 



return (s , F , ]] e=0 r° 



id\ 



Algorithm 7 addSeq(io, ii,x, s 0l F ): add customers for 
a part of state sequence (x^ , . . . , x^) 

L = i\ — io — 1 

for I = to L 1 do 

r| = getProb(x, 6+ f,x, 6+ £_i,Sf) 

g&Prob(y lb+ i,x* b+i ,Fi) 
s e+ i = addCustomer(a;* i)+ ^,x* ii+£ _ 1 ,S£) 
F e+ i = addCustomer(y l6+f , x* b+e , Fg) 
end for 

r* L = getProb(a; Jl ,.T*_ 1 ,s| / ) 

s* = addCustomer(a;j 1 -|-i,,a;* 1 _ 1 , s£); F* = F£ 

return (s* , F* L ,U^ =0 r* e ) 



Algorithm 8 Blocked Gibbs sweep for HCRP-HMM 
Input: j/i, . . . observed emissions 

x = xi, . . . , xt- previously inferred states 

s: set of CRP seating arrangements 

F: set of emission distributions 

B : number of blocks 
Choose block boundaries i±, , . , , ib-i € {2, . . . , T}; 
io := 1, i B = T 

for b = 0, . . . , B — 1, in random order do 

(s , F , r old ) = removeSeq(i b , i b+1 - i b ,x, s, F, 0); 

x* = X{ for all t < % or t > tb+i 



-l) = 



FBSampler(7r| So , F , y ib :i b+L -i, x ib -i, x ib+L ) 
Calculate q old = q(x ib , ... ,x ib+ L-i) and q* = 
q(x* b , . . . ,x* b+L _ l ) 
gold = CRP( 7j F) 
Q* = CRP(7,JJ) 

if Xt b+1 refers to a new state in So then 

Q* := addCustomer(x ti)+1 , Q*) 

Q old := addCustomer(a; t6+1 ,g oW ) 
end if 

for t = tb to tb+i — 1 do 

if Xi refers to a new state in Sq then 



^old 



^old 



q-- *getProb(a; I ,(5 oW ) 
Q oW := addCustomer(a; l ,Q oW ) 
end if 

if x* = k new then 

sample s ~ Q* ; x* := s 

:= *getProb(a;*,g*) 
Q* := addCustomer(a;i, Q*) 
end if 
end for 

•So = So; Fq = F 

(s*,F*,r* = addSeq(io, L, x, s ,F ) 



n old 

R := mini 1, r old ■ r* 

q* 



(x*,s*,F*) 



with probability R 



(x, s,F) . . ^ old ^ pold ^ Qtherwise 



end for 



D Split-Merge sampler 



Algorithm 9 Split-Merge Sampler for an HCRP-HMM 
Input: 2/1, ... , ut- observed emissions 

xi , . . . , xt'- previously inferred states 

s oid. set Q £ qrp seating arrangements 

F old : set of emission distributions 
Rthr _ Uniform(0, 1) 
Choose distinct ti,t 2 <E {1, . . . , T} 

Identify all fragments (pi, ei) s.t. for all t e (6,, . . . , e*), € {x tl ,x t . 2 } At ^ {ii, £2}, and not contained in other 
fragments 

Permute fragments randomly 
Let U be the number of fragments 

„old TP jTiold 

Srj+l — S , t j/+l — r 

if = xt 2 then 

{ Try split move } 

for i = U downto 1 do 

(si,F t ,rf d ) = removeSeq(b t , e.i, x, s i+1 , F i+1 ) 

end for 

x* t2 = new k index 
else 

{ Try merge move } 

for i = U downto 1 do 

(si,Fi,rf d ) = removeSeq(.T 6i:ei ,s i+ i,F i+ i) 

old _ SeqProb(7r|s i+1 , Fi,y bi . ei ,x bt -i :et+1 ) 

1 ForwardProb(7r| Si+1 , Fi,y bi:ei , x bi _i,x ei+1 ; {x tl , x t2 }) 
end for 

x t 2 ~ x ti 
end if 

{ Remove customers that accounts for transitions around xf d } 

F = removeCustomer(j/ t2 ,x t2 ,Fi) 

pold=p(y t2 \ Fxt2 ) 

s := Si 

if £2 — 1 is not in any fragment then 

So := removeCustomer(a;( 2 , Xt 2 -i, «o)) 

rg id * = getProb(x t2 ,a; t2 _i,Si)) 
end if 

if f 2 + 1 is not in any fragment then 

So := removeCustomer(.r t2 + l,a; t2 ,So)) 

rg w * = getProb(a- t2+ ix t2 ,s )) 
end if 

Qo d = Qo = 1 

(continue to Algorithm 10) 



Algorithm 10 Split-Merge Sampler for an HCRP-HMM (continued) 



{ Add customers that accounts for transitions around x£ } 

p* =p(yt 2 \KiJ 

F* = addCustomer(?/ t2 ,a; t * 2 ,Fg) 

s* := s 

if t2 + 1 is not in any fragment then 

r^* = getProb(x t2 +iX t2 , s*)) 

:= addCustomer(ir t2 + 1, Xt 2 , s*)) 
end if 

if t-2 — 1 is not in any fragment then 

r-(5* = getProb(x tl \x t2 -i, sf)) 
s| := addCustomer(a; t2 , xt 2 — 1, sf)) 
end if 

if xt 1 = xt 2 then 

{ Try split move } 

for i = 1 to U do 

SeqProb(7r| s?+i , F*,y bz]ez , ^._ 1:e . +1 ) 

9* 



Forward Prob(7r| s . +i , F*,y bv . et , x* b ._ v x* e . +1 ; {x* ti , a; t * 2 }) 
(s* +2 ,F* +1 ,r*) = addSeq(6 J ,e l ,a;*,s* +1 ,F,*) 
end for 
else 



for i = 1 to U do 

* oid 

rjcur _ T~~\ L ~ — TT -1 

— lli'=0 q * ' lii=0 r oid 

if R thr > R cur then 

rejection determined, exit loop 
end if 

q* = i 

(s* , F*_ 1: r*) = addSeqO, ei , x, s i+1 ,Fi) 
end for 
end if 

* old 

u - rt 1 I±- rr' 9i 

ll l= „,old ' f li=l 



{ Try merge move } 



(x,a,F) 



It 

x*,s*,F*) R thr <R 
x old , s old , F old ) otherwise 



